Goto

Collaborating Authors

 reference view


Segment Anything in 3D with NeRFs

Neural Information Processing Systems

We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt ( e.g., rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM.




Segment Anything in 3D with NeRFs

Neural Information Processing Systems

We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt ( e.g., rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM.



Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation

Waithaka, John, Busogi, Moise

arXiv.org Artificial Intelligence

Semantic segmentation of satellite imagery is crucial for Earth observation applications, but remains constrained by limited labelled training data. While self-supervised pretraining methods like Masked Autoencoders (MAE) have shown promise, they focus on reconstruction rather than localisation-a fundamental aspect of segmentation tasks. We propose adapting LOCA (Location-aware), a position prediction self-supervised learning method, for multimodal satellite imagery semantic segmentation. Our approach addresses the unique challenges of satellite data by extending SatMAE's channel grouping from multispectral to multimodal data, enabling effective handling of multiple modalities, and introducing same-group attention masking to encourage cross-modal interaction during pretraining. The method uses relative patch position prediction, encouraging spatial reasoning for localisation rather than reconstruction. We evaluate our approach on the Sen1Floods11 flood mapping dataset, where it significantly outperforms existing reconstruction-based self-supervised learning methods for satellite imagery. Our results demonstrate that position prediction tasks, when properly adapted for multimodal satellite imagery, learn representations more effective for satellite image semantic segmentation than reconstruction-based approaches.


DM-OSVP++: One-Shot View Planning Using 3D Diffusion Models for Active RGB-Based Object Reconstruction

Pan, Sicong, Jin, Liren, Huang, Xuying, Stachniss, Cyrill, Popović, Marija, Bennewitz, Maren

arXiv.org Artificial Intelligence

Many autonomous robotic applications depend on accurate 3D models of objects to perform downstream tasks. These include object manipulation in household scenarios (Breyer et al. 2022; Dengler et al. 2023; Jauhri et al. 2024), harvesting and prediction of intervention actions in agriculture (Pan et al. 2023; Lenz et al. 2024; Y ao et al. 2024), as well as solving jigsaw puzzles of fragmented frescoes in archaeology (Tsesmelis et al. 2024). For these applications, high-fidelity 3D object representations are critical to enable precise action execution and informed decision-making. When deployed in initially unknown environments, robots are often required to autonomously reconstruct 3D models of objects to understand their geometries, textures, positions, and orientations before taking action. Generating these models typically involves capturing data from multiple viewpoints using onboard sensors such as RGB or depth cameras. Data acquisition solely following predefined or randomly chosen sensor viewpoints is inefficient, as these approaches fail to adapt to the geometry and spatial distribution of the object to be reconstructed. This can lead to inferior reconstruction results, especially when objects are complex and contain self-occlusions. To address this, we propose using active reconstruction strategies, where object-specific sensor viewpoints are planned for data acquisition to achieve high-quality 3D object reconstruction. The key aspect of active reconstruction is view planning for generating viewpoints (Zeng et al. 2020a) that enables the robot to acquire the most informative sensor measurements.


Foundation Feature-Driven Online End-Effector Pose Estimation: A Marker-Free and Learning-Free Approach

Wu, Tianshu, Zhang, Jiyao, Liang, Shiqian, Han, Zhengxiao, Dong, Hao

arXiv.org Artificial Intelligence

Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from the CAD model and target image, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multi-historical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance.


Close-up-GS: Enhancing Close-Up View Synthesis in 3D Gaussian Splatting with Progressive Self-Training

Xia, Jiatong, Liu, Lingqiao

arXiv.org Artificial Intelligence

3D Gaussian Splatting (3DGS) has demonstrated impressive performance in synthesizing novel views after training on a given set of viewpoints. However, its rendering quality deteriorates when the synthesized view deviates significantly from the training views. This decline occurs due to (1) the model's difficulty in generalizing to out-of-distribution scenarios and (2) challenges in interpolating fine details caused by substantial resolution changes and occlusions. A notable case of this limitation is close-up view generation--producing views that are significantly closer to the object than those in the training set. To tackle this issue, we propose a novel approach for close-up view generation based by progressively training the 3DGS model with self-generated data. Our solution is based on three key ideas. First, we leverage the See3D model, a recently introduced 3D-aware generative model, to enhance the details of rendered views. Second, we propose a strategy to progressively expand the ``trust regions'' of the 3DGS model and update a set of reference views for See3D. Finally, we introduce a fine-tuning strategy to carefully update the 3DGS model with training data generated from the above schemes. We further define metrics for close-up views evaluation to facilitate better research on this problem. By conducting evaluations on specifically selected scenarios for close-up views, our proposed approach demonstrates a clear advantage over competitive solutions.


Novel Object 6D Pose Estimation with a Single Reference View

Liu, Jian, Sun, Wei, Zeng, Kai, Zheng, Jin, Yang, Hui, Wang, Lin, Rahmani, Hossein, Mian, Ajmal

arXiv.org Artificial Intelligence

Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in the camera coordinate system based on state space models (SSMs). Specifically, iterative camera-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.